feat: Data quality measures #420

nickevansuk · 2023-04-05T11:49:24Z

This PR introduces data quality measures, based on validator results.

Measures are defined by ”exclusions”, which are references to specific types of validator errors. When a measure is calculated, and item is counted towards the measure total count unless is it is “excluded” by one of the validator errors referenced by its “exclusions”.

For example:

{
  name: 'Has a name',
  description: 'The name of the opportunity is essential for a participant to understand what the activity',
  exclusions: [
    {
      errorType: [
        ValidationErrorType.MISSING_REQUIRED_FIELD,
      ],
      targetFields: {
        Event: ['name'],
        FacilityUse: ['name'],
        IndividualFacilityUse: ['name'],
        CourseInstance: ['name'],
        EventSeries: ['name'],
        HeadlineEvent: ['name'],
        SessionSeries: ['name'],
        Course: ['name'],
      },
    },
  ],
},

In the above measure, an item will not be counted towards the total and percentage if the validator error “MISSING_REQUIRED_FIELD” is present for the target fields “name”.

The advantage of this approach is that the complex inheritance rules respected by the validator are implicitly considered, and that more complex validation rules such as activity list matching are easily included without any duplicated logic. Tests can easily be written for complex rules also, as the validator already provides a framework for this.

This increases maintainability, flexibility, and consistency of results across tools. The approach is also extensible, and encourages the creation of new data quality rules in the validator as data quality measures become more in-depth: this has the advantage of surfacing errors at a more detailed level within the various OA tools, as well as providing a high-level summary.

Measures are defined within “profiles”, which allows for subsets of measures to be defined distinctly for different use cases (e.g accessibility).

Measures are defined within this repository, so that they can be used within both the Validator GUI and the Test Suite, and be maintained alongside the validation rules on which they depend.

(Note that this PR is in draft, and requires some refactoring and tidying up before merging)

Screenshot of unstyled results below:

Open questions:

How would we ideally display this visually to users? (The output is a simple mustache template; rough design spreadsheet here)
Do we need to think about combining parent and child in the feed within the test suite for a more accurate assessment of e.g. the url? (Less relevant for the current measures, which are mostly based on required fields)

howaskew · 2023-04-05T12:16:01Z

Here's an example output from my work via the visualiser...

The idea is a simple, intuitive, visual summary of the smaller set of DQ metrics discussed at W3C. It's a stepping stone into the detail in the validator report.

nickevansuk · 2023-04-05T12:47:19Z

@howaskew looks great! Postcode validation is a great example of a rule that would be helpful in the validator too (centralising logic etc).

It is cool having it visible in the visualiser as data users might be browsing feeds there - am thinking about whether setting up the validator to build as a lightweight client-side library might give us the best of both worlds - centralising logic and still having the view on the visualiser...

Or even easier we could just store nightly DQ reports which we embed in a tab on the visualiser, and reference on the status page. That might be even better? One pre-cached source of truth.

feat: Data quality measures

756d730

nickevansuk mentioned this pull request Apr 5, 2023

feat: Data quality measures openactive/openactive-test-suite#533

Draft

nickevansuk marked this pull request as draft April 5, 2023 11:53

Add description

6e0d7ec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Data quality measures #420

feat: Data quality measures #420

Uh oh!

nickevansuk commented Apr 5, 2023 •

edited

Loading

Uh oh!

howaskew commented Apr 5, 2023

Uh oh!

nickevansuk commented Apr 5, 2023

Uh oh!

Uh oh!

feat: Data quality measures #420

Are you sure you want to change the base?

feat: Data quality measures #420

Uh oh!

Conversation

nickevansuk commented Apr 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

howaskew commented Apr 5, 2023

Uh oh!

nickevansuk commented Apr 5, 2023

Uh oh!

Uh oh!

nickevansuk commented Apr 5, 2023 •

edited

Loading